Applied Machine Learning (with Python)

What are we going to learn?

Today…

  1. Finding and exploring a dataset
  2. Making our first predictions
  3. How did we do?
  4. Applying machine Learning
  5. Where do we go from here?

Data Science and AI, wot?

Finding our data

The Kaggle Titanic Dataset

… now we can begin exploring!

Practical 1 - Understanding our data

The Unreasonable Effectiveness of Linear Regression

But we’re not restricted to one factor!

results = px.get_trendline_results(fig)
print(results.px_fit_results.iloc[0].summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.466
Model:                            OLS   Adj. R-squared:                  0.460
Method:                 Least Squares   F-statistic:                     74.32
Date:                Thu, 15 Jun 2023   Prob (F-statistic):           3.16e-13
Time:                        14:14:59   Log-Likelihood:                -108.49
No. Observations:                  87   AIC:                             221.0
Df Residuals:                      85   BIC:                             225.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0480      0.226      4.630      0.000       0.598       1.498
x1             0.0989      0.011      8.621      0.000       0.076       0.122
==============================================================================
Omnibus:                        7.712   Durbin-Watson:                   2.161
Prob(Omnibus):                  0.021   Jarque-Bera (JB):               14.664
Skew:                          -0.127   Prob(JB):                     0.000654
Kurtosis:                       4.995   Cond. No.                         49.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

What are we doing?

  • Really, we’re just fitting a line

  • \(y_i = \beta_0 + \kappa T_i + \beta_1 X_{1i} + ... +\beta_k X_{ki} + u_i\)

  • But that line can get super, super squiggly

  • Machine learning is just using compute to make the best multidimensional squiggle

From Stats to ML

Testing and Validating

  • Traditional statistics often predicted “in sample”
  • Instead, we use test data (WHICH IS WAY EASIER)
  • So think really hard about:
    • whether your test set is meaningful
    • what baseline performance is

Metrics that matter

Metrics Practical

Linear Models Recap

  • Linear models are generally excellent places to start
  • Use them to think about your data: how it’s built, how it’s connected, and what you’re aiming to achieve
  • With a bit of work, these can perform shockingly well
  • but now it’s time to go deeper

Into the forest of ML

Scikit-learn: your big ML toolbox

The sklearn paradigm

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb..predict(X_test)

Which Model for Which Problem (and why it’s always XGBoost)

Gradient Descent - When Lines get real squiggly

So how do we fit the super squiggly line?

Gradient, Loss and Boosting

  • Neural networks approaches rely on gradient descent to maximise how well they fit
  • Gradient boosters rely on combining (or boosting models)
  • Combining your forecasts can be hugely effective

Tous Ensemble!

Combining Models

  • Boosting is only one approach
  • Just like Random Forests, the best models are often combinations
  • Lets implement it!

Risks and Benefits to Ensemble

  • Has your model performed better?
  • What is the implication for fit and variance?

Optional Materials

Class Imbalance

  • How imbalanced is our model?
  • what impact do you think this has in reality?
  • There are a range of approaches:
    • Weighted Errors
    • Bespoke algorithms (SMOTE etc)

Where do we go from here?

Text and NLP

  • We haven’t used the name category at all
  • How would you extract value from it?

Image Recognition

  • Thinking beyond tabular data, how could what you’ve learnt be applied to images?

Helpful Resources

Come to our hackathons!

Thanks!

  • avarotsis@no10.gov.uk

  • Andreas Varotsis_10DS @ GovDataScience Slack

  • andreasthinks@twitter.com

  • andreasthinks@fosstodon.org

  • Andreas Varotsis @ Kaggle